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Abstract 


The objective of this study is to investigate the efficient determination of C 
and 7 for Support Vector Regression with RBF or mahalanobis kernel based 
on numerical and statistician considerations, which indicates the connection 
between C and kernels and demonstrates that the deviation of geometric dis¬ 
tance of neighbour observation in mapped space effects the predict accuracy 
of e-SVR. We determinate the arrange of 7 & C and propose our method to 
choose their best values. 


Introduction 

Traditional forecasting algorithm like ARIMA, Exponential Smoothing can 
provide good forecasting results with regard to trend, season and other linear 
correlated features . In practice, those features are normally non-linear. To 
solve non-linear forecasting problems, e-Support vector regression (e-SVR) 
is employed. Support vector machine (SVM) is nowadays wildly used for 
classification problems in many areas. However, e-SVR is hardly used because 
of the uncertain parameter C , e for its dual problem. By using RBF or 
Mahalanobis kernels, value of 7 decides determination of kernel matrix and 
hence is the key of whole system. An overview of choosing those parameters 
is given by [1]. Best selection of C is given by [6], which can be done with 
in limited iterations. The most used methods are searching methods like 
random search, grid search, pattern search. Cherkassy and Ma (2004) [ j 
have proposed one way to determinate those parameters directly from the 
data. But there is still one parameter c need to be extra searched within 
pre-existing arrange. Our propose is also to determinate C and 7 directly 
but without using any searching methods. In this paper, we give a short 
overview of e-SVR in section 1 and discuss the determination of C and 7 in 
section 2 and 3. Then we test our algorithms using practice data in section 
4 and summery results in section 5. 

1 e-Support Vector Regression 

Assume (07, yi), (x 2 ,1/2), •••, Vn), ■ ■ ■ G X x R C R n+1 are observation 
pairs, Xi G X C R n is feature vector and yi G R is the target output. Define 
Nx as the total number of all observation in training set. According to [3] 
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the dual problem form of e-SVR under given C with kernel K is 

min ^(a - a*) T Q(a - a*) + + a*) J r^y i {oi i - a*) 

subject to e T (a — a*) = 0, 7) 

0 < oti,a* <C,i = 1, ...,N S , 

where Q t ,j := K(xi,Xj ) and a = (a±, ...,otN s ), a* = (a*,..., Qfjv )- The corre¬ 
sponding approximate function is 

N 3 

/7) = + a*)K(xi,x) +b, (2) 

i= 1 

where N s is the total number of Support Vectors which are obtained from 
input parameter e. We consider e as accuracy indicator or acceptable tol¬ 
erance within this system. In practice, we use natural defined tolerance as 
input value for e. 


2 Optimal 7 

A feature map (f>: x i— > 4>(x) E R m , m > n builds new norm with respect to 
kernel K: 

11070 - 077II 2 = K(xi, Xi) + K(xj, Xj) - 2 K(xi, Xj) 

= 2 — 2 K(xi,Xj), for RBF kernel 
\/x h Xj EX C R n . 


In mapped space, the splitting of two neighbour observation , which means 
the deviation of mapped features should be large. Small values of deviation 
mean mapped features have very few difference and lead to linear depen¬ 
dences of kernel matrix in practice which produce few independent features 
and enlarge the solutions space, it is not difficult to show that features in 
RBF Kernel mapped features have same structure as in original which means 

17* - x j || < | \x k - xi\\ ||07O - 07711 < 11077 “ 07011- 

More independence of mapped features lead to small solution spaces. One 
measure for this character is the deviation of all 2-te Norms between every 
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two observation. Define Deviation Function of observation as 


L(i) ■■= 


N X {N X ~1) 


E (II <i>( x i)-<i>{xj 


N X (N X - 1 ) {k l)( _ x 


E H( x k)-Hxi)\\ 


The Deviation Function L( 7 ) describes geographic differences of whole obser¬ 
vation in mapped space with help of 2-te Norm and is convex. Our best choice 
of 7 opt is the point when the L( 7 ) reaches its maximum where L ( 7 opt ) = 0 
with fist derivation formal 


L\ 7) = 


N *( N * ~ !) 
D(xi,Xj ) - 


E 


Xj 


N X (N X - 1 ) 

and its second derivation formal is 


N *( N * - 1 ) ft) 


E L(x k ,xi ) 


^'(7) = 


iV^jV* - 1) 

/ 

D(xi,Xj ) - 

2 


E 1 


ij) ~ 777; - 1 ) £ 


(k,i)ex 


———- E £>(**,**) 

n x {n x - ij (M)eA , 


- 1 ) 
t 

D'(xi, Xj) 
2 


E 


Xi 


N X (N X - 1 ) {k j^ x 


E L(x k ,xi ) 


—-—- E 

- 1 ) {k l)&x 


N X (N X - 1 ) ( .jj^ x 
2 


E D(xi,Xj)- 


N x(N X - 1 ) {k ^ x 
1 


n x (n x - 1 ) { ^_ x y ^ *’ ^ n x (n x - 1 ) {kl)ex 


E D(x k ,xi) 
-.,i)ex 

\ 

E L{x k ,xi) 


D'(xi,Xj) - 


N X (N X - 1 ) {k j^ x 


E D\x k ,xi) , 


with respect to 


L(xi,Xj) = ( 2 - 2 exp (—7 G(x i ,x j )) 
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D(xi,Xj) 


G(x i ,x j )exp(—'yG(x i ,x j ))(2 - 2-exp(-7 G(x h Xj)) 2 . 


D'(xi,Xj) 


~(G(xi, xj )) 2 exp(—7 G(x h xj)) ( 2-2 exp(-7 G(x h Xj))) 2 
• (l + exp(--fG(xi,Xj ))(2 - 2 exp(-~/G(x i ,x j )))~ 1 ^) 

—G(xi, Xj)D(xi, Xj) (l + exp(-7 G(x i ,x j ))(L(x i ,x j ))~ 2 ^ 


For RBF Kernel K(x^x) = exp(—7||xj — x|| 2 ): 


G(xi,Xj ) = \\xi — Xj\\ 2 . 

For Mahalanobis Kernel K(x^x) = exp(— l~{ x i ~ x )Q~ l ( x i ~ x ))~- 


G(xi,Xj ) = —(xi-x i+1 )Q ( Xi-Xj ), 


with 

Nx 

Q = ^rJ2( x k~ c )( x k-c) T i 

k =i 

c = F E^. 

fc=i 

^ AT* 

m = — V(x fe - c) T Q~ 1 (x k - c). 

k=1 

Because of the convexity of deviation function, its maximum always exists 
and by using numeric method like Newton Method we obtain very good 
balance between computation performance and model accuracy. Our pro¬ 
posed deviation function L(y) is also very helpful for determining arrange 
for searching method (see figure 1 ). The arrange of 7 is (0,7/,.), where 
7 h = arg minL(7). At point 7/,, the value of kernel function K(xi,Xj ) ~ 
0 , for i 7^ j, which leads to strong over-fitting problem (see figure 6). The 
best value of 7 is at arg max L( 7). 
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Gamma Values 


Figure 1: Deviation function L( 7) for RBF kernel 


3 Optimal choice of C 

We use the proposed method from [6] to get best value of C, which provides 
very stable solution. In order to reduce the total number of iterations, we 
give a reasonable initial C before it begins by applying mean value theorem 
in mapped space. From (2) exists x G X satisfying 

hi-VjW = ll/ / (0O 3 O) ' OOd) - 0(®j))ll>Vi, j e {1 


For RBF Kernel, we can proof that 

hi - yj II 

II <t>{Xi) - <f>{Xj) II 


< wnHm<NsC. 


For worst situation that N s — 1 and || <fi(xi) — (f>(xj) || = exp(- 
we obtain 

C > max || y { - y^\ exp(7||^ - ^|| 2 ). 


Xi 


(3) 


3b II 2 ), 
(4) 
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We use right side of (4) as initial C ini for iteration of finding best C. New 
value of C after every solving of SVR is calculated by 


r 

w new 


E L e (yi 

Xi&Xc 


N x + N s 


f(xi))+ E 


+ 


eNx 


eC+1 


(5) 


with 


X c = {xi | | yi - f(xi) | > e with = C , a* = 0 or a { = 0, a* = C}, 
I’m = {^i| |Vi — f(xi)\ = e with 0 < < C, or 0 < a* < C} 


and e loss function L e is defined as 


Le{Vi ~ f(Xi )) 


0 

I Vi ~ f(xi) | 


for |yi - f(xi) | < e 
e otherwise. 


( 6 ) 



Figure 2: Devation of C new with respect to different initial value C im 

According to our experiments, our proposed initial value has relative small 
deviation during computation of C new (see figure 2 ), which reduces the 
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iteration number by setting stop critical as A changes between new and last 
C new . Experiments also show that no matter how big C^m is, it archives its 
stable value within limit iterations. Figure 3 and Table 1 show that C value 
changes not so much just after second solving SVR in our experiments by 
input data of gas consumption with temperature and weekdays as features. 



Iteration 

Figure 3: Iteration of finding stable C with different C mi 


Cini 

1st Crew 

2nd C new 

3rd C new 

4th Crew 

5th Crew 

6th Crew 

7th Crew 

8th C n ew 

9th Crew 

10th Crew 

2.582123 

2.887401 

2.895912 

2.893142 

2.903892 

2.898490 

2.898554 

2.898552 

2.898462 

2.898543 

2.893128 

0.100000 

2.738660 

2.897665 

2.898563 

2.898521 

2.898552 

2.898515 

2.898516 

2.898517 

2.898550 

2.893649 

1.000000 

2.836104 

2.893117 

2.893175 

2.903887 

2.898516 

2.898489 

2.898544 

2.898492 

2.898560 

2.898525 

2.000000 

2.866985 

2.898480 

2.898493 

2.898526 

2.898519 

2.898527 

2.898528 

2.898566 

2.898478 

2.898450 

4.000000 

2.891488 

2.898533 

2.898536 

2.898512 

2.898500 

2.898469 

2.898476 

2.898543 

2.898502 

2.898529 

8.000000 

2.892421 

2.893184 

2.898528 

2.898532 

2.898523 

2.898495 

2.898494 

2.898525 

2.898560 

2.898519 

16.000000 

2.936898 

2.893208 

2.893217 

2.893139 

2.903910 

2.898495 

2.898543 

2.898531 

2.898510 

2.898463 

32.000000 

2.977581 

2.897040 

2.898510 

2.898527 

2.898548 

2.897689 

2.898522 

2.898533 

2.898511 

2.898530 

64.000000 

2.995011 

2.905414 

2.902729 

2.898555 

2.898485 

2.898533 

2.898483 

2.898527 

2.898541 

2.898588 

256.000000 

3.090617 

2.909661 

2.898573 

2.898489 

2.898521 

2.898504 

2.898506 

2.898570 

2.898555 

2.898496 


Table 1: Iterations Table 















4 Experiments 


All experiments were performed by using practice data from an energy com¬ 
pany from 2009-01-01 to 2011-12-31, which contains 15 different features. 
The training set was set from 2009-01-01 to 2011-09-24. We determinated 
the curves of 7 according to the description in section 3 and used 7 at max¬ 
imum of L( 7) for determination of parameter C. Back test was performed 
by using data from 2011-09-25 to 2011-12-31 for RBF kernels. Those re¬ 
sults were compared with results generated by searching 7 e {2~ 15 ,..., 2 3 } 
and C e {2 -5 , ...,2 15 }. Package ”el071” of R was used. We scaled arrange 
of data into [0,1] and used default scale option for further calculation. We 
employed root-mean-square error (RMSE) 


RMSE 


Tf =1 (y^/te )) 2 

N 


and mean absolute percentage error (MAPE) 


MAPE 


100% y. I Vi- f(Xj) 
N Vi 


as measures of accuracy of e-SVR. Except those two famous measures, we 
define a total new measure System Error Deviation Measure (SEDM) 

as 


SEDM 


N 


N 


2=1 


E ({Vi ~ f{xi)) ~ 


N 


N 

E 

3= 1 


(yj - f( x i)) 


(7) 





Training 

Back Test 


7 

C 

RMSE 

MAPE 

RMSE 

MAPE 

Our Propose: 

0.03224 

0.42112 

0.03808 

13.5430% 

0.05082 

15.4615% 

Tune Method: 

0.04000 

0.99000 

0.03664 

12.6075% 

0.05262 

15.7146% 

Best Solution: 

0.02 

0.61 

0.03758 

13.6883% 

0.05005 

14.9604% 

Over-fitting: 

96.27366 

55.30339 

5.06542e-05 

0.0279% 

0.16210 

55.3389% 


Table 2: Results with e = 0 for figure 4 
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Table 2 shows that our propose produced better results than tune method, 
which has solved SVR over 10,000 times and has the smallest value in training 
area. The 7 value of best solution is smaller than our propose, which always 
has the smallest RMSE value in back test area. Setting of over-fitting is 
7 = arg min L(7). Figure 4 shows the whole 7-C-RMSE space. Our propose 
locates in the same level as grid search and best solution. 



Figure 4: Solution Space e = 0 for Table 2 

Table 3 shows the case that e ^ 0. We tuned all three parameter extra 
within [10 -10 ,1] x [10 -10 ,1] x [0,1] for 7, C , and e, which has solved SVR over 
1,000,000 times. We chose smallest RMSE value in training for Extra Tuning. 
But its forecasting error is worse than results in Table 2. Therefore, we used 
the 'y best from Best Solution, 7 opt from Our Propose in Table 2 and each was 
tuned within interval [10“ 10 ,1] for C and [0, 0.99] for e. Our propose for 
7 opt produced better results in Table 3 with e = 0.00516 than Tune Method, 
which produced better results for 7 best- Figure 4 shows locations of our 
experimented solutions in C-e-RMSE space for each 7 opf and 7 best- 
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7 

C 

6 

Training 

RMSE MAPE 

Back Test 

RMSE MAPE 

Our Propose 

7 opt 

'"Ybest 

0.03224 

0.02 

0.42112 

0.22035 

0.00516 

0.02768 

0.03808 

0.03967 

13.5563% 

14.9641% 

0.05079 

0.05038 

15.4560% 

15.6418% 

Tune Method 

7 opt 

0.03224 

0.99 

0.07 

0.03635 

14.0366% 

0.05185 

15.9645% 

/ "Ybest 

0.02 

0.99 

0.05 

0.03681 

13.9567% 

0.05013 

14.7957% 

Best Solution 

7 opt 

0.03224 

0.16 

le-10 

0.04058 

14.4724% 

0.05026 

15.5602% 

7 best 

0.02 

0.71 

0.05 

0.03725 

14.1742% 

0.04992 

14.8701% 

Extra tuning 

0.06 

0.99 

0.09 

0.03599 

14.3002% 

0.05584 

18.0309% 


Table 3: Results with optimal e for figure 5 




(a) 7 opt 


(b) 7 best 


Figure 5: Solution Space for Table 3 
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(a) 7 = argmaxL( 7 ) 




(b) Using searching method (c) 7 = argminL( 7 ) (Over-fitting Prob¬ 

lem) 

Figure 6: Results generated by package ”e!071” in R 


5 Conclusion 

Deviation function £(7) and set C give the arrange of 7 and C , also pro¬ 
vide possible solutions to choose their optimal values. By searching method, 
RMSE of training set became smaller than before. 
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